Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing

Identifieur interne : 000E30 ( Main/Exploration ); précédent : 000E29; suivant : 000E31

Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing

Auteurs : Yaron Orenstein [États-Unis] ; David Pellow [Israël] ; Guillaume Marçais [États-Unis] ; Ron Shamir [Israël] ; Carl Kingsford [États-Unis]

Source :

RBID : PMC:5645146

Descripteurs français

English descriptors

Abstract

With the rapidly increasing volume of deep sequencing data, more efficient algorithms and data structures are needed. Minimizers are a central recent paradigm that has improved various sequence analysis tasks, including hashing for faster read overlap detection, sparse suffix arrays for creating smaller indexes, and Bloom filters for speeding up sequence search. Here, we propose an alternative paradigm that can lead to substantial further improvement in these and other tasks. For integers k and L > k, we say that a set of k-mers is a universal hitting set (UHS) if every possible L-long sequence must contain a k-mer from the set. We develop a heuristic called DOCKS to find a compact UHS, which works in two phases: The first phase is solved optimally, and for the second we propose several efficient heuristics, trading set size for speed and memory. The use of heuristics is motivated by showing the NP-hardness of a closely related problem. We show that DOCKS works well in practice and produces UHSs that are very close to a theoretical lower bound. We present results for various values of k and L and by applying them to real genomes show that UHSs indeed improve over minimizers. In particular, DOCKS uses less than 30% of the 10-mers needed to span the human genome compared to minimizers. The software and computed UHSs are freely available at github.com/Shamir-Lab/DOCKS/ and acgt.cs.tau.ac.il/docks/, respectively.


Url:
DOI: 10.1371/journal.pcbi.1005777
PubMed: 28968408
PubMed Central: 5645146


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Designing small universal
<italic>k</italic>
-mer hitting sets for improved analysis of high-throughput sequencing</title>
<author>
<name sortKey="Orenstein, Yaron" sort="Orenstein, Yaron" uniqKey="Orenstein Y" first="Yaron" last="Orenstein">Yaron Orenstein</name>
<affiliation wicri:level="1">
<nlm:aff id="aff001">
<addr-line>Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, Massasschusetts, United States of America</addr-line>
</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, Massasschusetts</wicri:regionArea>
<wicri:noRegion>Massasschusetts</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Pellow, David" sort="Pellow, David" uniqKey="Pellow D" first="David" last="Pellow">David Pellow</name>
<affiliation wicri:level="1">
<nlm:aff id="aff002">
<addr-line>Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel</addr-line>
</nlm:aff>
<country xml:lang="fr">Israël</country>
<wicri:regionArea>Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv</wicri:regionArea>
<wicri:noRegion>Tel-Aviv</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Marcais, Guillaume" sort="Marcais, Guillaume" uniqKey="Marcais G" first="Guillaume" last="Marçais">Guillaume Marçais</name>
<affiliation wicri:level="4">
<nlm:aff id="aff003">
<addr-line>Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America</addr-line>
</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania</wicri:regionArea>
<placeName>
<region type="state">Pennsylvanie</region>
<settlement type="city">Pittsburgh</settlement>
</placeName>
<orgName type="university">Université Carnegie-Mellon</orgName>
</affiliation>
</author>
<author>
<name sortKey="Shamir, Ron" sort="Shamir, Ron" uniqKey="Shamir R" first="Ron" last="Shamir">Ron Shamir</name>
<affiliation wicri:level="1">
<nlm:aff id="aff002">
<addr-line>Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel</addr-line>
</nlm:aff>
<country xml:lang="fr">Israël</country>
<wicri:regionArea>Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv</wicri:regionArea>
<wicri:noRegion>Tel-Aviv</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Kingsford, Carl" sort="Kingsford, Carl" uniqKey="Kingsford C" first="Carl" last="Kingsford">Carl Kingsford</name>
<affiliation wicri:level="4">
<nlm:aff id="aff003">
<addr-line>Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America</addr-line>
</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania</wicri:regionArea>
<placeName>
<region type="state">Pennsylvanie</region>
<settlement type="city">Pittsburgh</settlement>
</placeName>
<orgName type="university">Université Carnegie-Mellon</orgName>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">28968408</idno>
<idno type="pmc">5645146</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5645146</idno>
<idno type="RBID">PMC:5645146</idno>
<idno type="doi">10.1371/journal.pcbi.1005777</idno>
<date when="2017">2017</date>
<idno type="wicri:Area/Pmc/Corpus">000F86</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000F86</idno>
<idno type="wicri:Area/Pmc/Curation">000F86</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Curation">000F86</idno>
<idno type="wicri:Area/Pmc/Checkpoint">000888</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Checkpoint">000888</idno>
<idno type="wicri:source">PubMed</idno>
<idno type="RBID">pubmed:28968408</idno>
<idno type="wicri:Area/PubMed/Corpus">000B33</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Corpus" wicri:corpus="PubMed">000B33</idno>
<idno type="wicri:Area/PubMed/Curation">000B33</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Curation">000B33</idno>
<idno type="wicri:Area/PubMed/Checkpoint">000D43</idno>
<idno type="wicri:explorRef" wicri:stream="Checkpoint" wicri:step="PubMed">000D43</idno>
<idno type="wicri:Area/Ncbi/Merge">001C00</idno>
<idno type="wicri:Area/Ncbi/Curation">001C00</idno>
<idno type="wicri:Area/Ncbi/Checkpoint">001C00</idno>
<idno type="wicri:doubleKey">1553-734X:2017:Orenstein Y:designing:small:universal</idno>
<idno type="wicri:Area/Main/Merge">000E33</idno>
<idno type="wicri:Area/Main/Curation">000E30</idno>
<idno type="wicri:Area/Main/Exploration">000E30</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Designing small universal
<italic>k</italic>
-mer hitting sets for improved analysis of high-throughput sequencing</title>
<author>
<name sortKey="Orenstein, Yaron" sort="Orenstein, Yaron" uniqKey="Orenstein Y" first="Yaron" last="Orenstein">Yaron Orenstein</name>
<affiliation wicri:level="1">
<nlm:aff id="aff001">
<addr-line>Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, Massasschusetts, United States of America</addr-line>
</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, Massasschusetts</wicri:regionArea>
<wicri:noRegion>Massasschusetts</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Pellow, David" sort="Pellow, David" uniqKey="Pellow D" first="David" last="Pellow">David Pellow</name>
<affiliation wicri:level="1">
<nlm:aff id="aff002">
<addr-line>Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel</addr-line>
</nlm:aff>
<country xml:lang="fr">Israël</country>
<wicri:regionArea>Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv</wicri:regionArea>
<wicri:noRegion>Tel-Aviv</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Marcais, Guillaume" sort="Marcais, Guillaume" uniqKey="Marcais G" first="Guillaume" last="Marçais">Guillaume Marçais</name>
<affiliation wicri:level="4">
<nlm:aff id="aff003">
<addr-line>Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America</addr-line>
</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania</wicri:regionArea>
<placeName>
<region type="state">Pennsylvanie</region>
<settlement type="city">Pittsburgh</settlement>
</placeName>
<orgName type="university">Université Carnegie-Mellon</orgName>
</affiliation>
</author>
<author>
<name sortKey="Shamir, Ron" sort="Shamir, Ron" uniqKey="Shamir R" first="Ron" last="Shamir">Ron Shamir</name>
<affiliation wicri:level="1">
<nlm:aff id="aff002">
<addr-line>Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel</addr-line>
</nlm:aff>
<country xml:lang="fr">Israël</country>
<wicri:regionArea>Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv</wicri:regionArea>
<wicri:noRegion>Tel-Aviv</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Kingsford, Carl" sort="Kingsford, Carl" uniqKey="Kingsford C" first="Carl" last="Kingsford">Carl Kingsford</name>
<affiliation wicri:level="4">
<nlm:aff id="aff003">
<addr-line>Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America</addr-line>
</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania</wicri:regionArea>
<placeName>
<region type="state">Pennsylvanie</region>
<settlement type="city">Pittsburgh</settlement>
</placeName>
<orgName type="university">Université Carnegie-Mellon</orgName>
</affiliation>
</author>
</analytic>
<series>
<title level="j">PLoS Computational Biology</title>
<idno type="ISSN">1553-734X</idno>
<idno type="eISSN">1553-7358</idno>
<imprint>
<date when="2017">2017</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Algorithms</term>
<term>Animals</term>
<term>Caenorhabditis elegans (genetics)</term>
<term>Computational Biology (methods)</term>
<term>Computer Heuristics</term>
<term>Genome, Bacterial</term>
<term>Genome, Human</term>
<term>High-Throughput Nucleotide Sequencing (methods)</term>
<term>Humans</term>
<term>Sequence Analysis, DNA (methods)</term>
<term>Software</term>
</keywords>
<keywords scheme="KwdFr" xml:lang="fr">
<term>Algorithmes</term>
<term>Analyse de séquence d'ADN ()</term>
<term>Animaux</term>
<term>Biologie informatique ()</term>
<term>Caenorhabditis elegans (génétique)</term>
<term>Génome bactérien</term>
<term>Génome humain</term>
<term>Heuristique informatique</term>
<term>Humains</term>
<term>Logiciel</term>
<term>Séquençage nucléotidique à haut débit ()</term>
</keywords>
<keywords scheme="MESH" qualifier="genetics" xml:lang="en">
<term>Caenorhabditis elegans</term>
</keywords>
<keywords scheme="MESH" qualifier="génétique" xml:lang="fr">
<term>Caenorhabditis elegans</term>
</keywords>
<keywords scheme="MESH" qualifier="methods" xml:lang="en">
<term>Computational Biology</term>
<term>High-Throughput Nucleotide Sequencing</term>
<term>Sequence Analysis, DNA</term>
</keywords>
<keywords scheme="MESH" xml:lang="en">
<term>Algorithms</term>
<term>Animals</term>
<term>Computer Heuristics</term>
<term>Genome, Bacterial</term>
<term>Genome, Human</term>
<term>Humans</term>
<term>Software</term>
</keywords>
<keywords scheme="MESH" xml:lang="fr">
<term>Algorithmes</term>
<term>Analyse de séquence d'ADN</term>
<term>Animaux</term>
<term>Biologie informatique</term>
<term>Génome bactérien</term>
<term>Génome humain</term>
<term>Heuristique informatique</term>
<term>Humains</term>
<term>Logiciel</term>
<term>Séquençage nucléotidique à haut débit</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>With the rapidly increasing volume of deep sequencing data, more efficient algorithms and data structures are needed. Minimizers are a central recent paradigm that has improved various sequence analysis tasks, including hashing for faster read overlap detection, sparse suffix arrays for creating smaller indexes, and Bloom filters for speeding up sequence search. Here, we propose an alternative paradigm that can lead to substantial further improvement in these and other tasks. For integers
<italic>k</italic>
and
<italic>L</italic>
>
<italic>k</italic>
, we say that a set of
<italic>k</italic>
-mers is a
<italic>universal hitting set</italic>
(UHS) if every possible
<italic>L</italic>
-long sequence must contain a
<italic>k</italic>
-mer from the set. We develop a heuristic called DOCKS to find a compact UHS, which works in two phases: The first phase is solved optimally, and for the second we propose several efficient heuristics, trading set size for speed and memory. The use of heuristics is motivated by showing the NP-hardness of a closely related problem. We show that DOCKS works well in practice and produces UHSs that are very close to a theoretical lower bound. We present results for various values of
<italic>k</italic>
and
<italic>L</italic>
and by applying them to real genomes show that UHSs indeed improve over minimizers. In particular, DOCKS uses less than 30% of the 10-mers needed to span the human genome compared to minimizers. The software and computed UHSs are freely available at github.com/Shamir-Lab/DOCKS/ and acgt.cs.tau.ac.il/docks/, respectively.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Roberts, M" uniqKey="Roberts M">M Roberts</name>
</author>
<author>
<name sortKey="Hayes, W" uniqKey="Hayes W">W Hayes</name>
</author>
<author>
<name sortKey="Hunt, Br" uniqKey="Hunt B">BR Hunt</name>
</author>
<author>
<name sortKey="Mount, Sm" uniqKey="Mount S">SM Mount</name>
</author>
<author>
<name sortKey="Yorke, Ja" uniqKey="Yorke J">JA Yorke</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Solomon, B" uniqKey="Solomon B">B Solomon</name>
</author>
<author>
<name sortKey="Kingsford, C" uniqKey="Kingsford C">C Kingsford</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Solomon, B" uniqKey="Solomon B">B Solomon</name>
</author>
<author>
<name sortKey="Kingsford, C" uniqKey="Kingsford C">C Kingsford</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Deorowicz, S" uniqKey="Deorowicz S">S Deorowicz</name>
</author>
<author>
<name sortKey="Kokot, M" uniqKey="Kokot M">M Kokot</name>
</author>
<author>
<name sortKey="Grabowski, S" uniqKey="Grabowski S">S Grabowski</name>
</author>
<author>
<name sortKey="Debudaj Grabysz, A" uniqKey="Debudaj Grabysz A">A Debudaj-Grabysz</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chikhi, R" uniqKey="Chikhi R">R Chikhi</name>
</author>
<author>
<name sortKey="Limasset, A" uniqKey="Limasset A">A Limasset</name>
</author>
<author>
<name sortKey="Jackman, S" uniqKey="Jackman S">S Jackman</name>
</author>
<author>
<name sortKey="Simpson, Jt" uniqKey="Simpson J">JT Simpson</name>
</author>
<author>
<name sortKey="Medvedev, P" uniqKey="Medvedev P">P Medvedev</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ye, C" uniqKey="Ye C">C Ye</name>
</author>
<author>
<name sortKey="Ma, Zs" uniqKey="Ma Z">ZS Ma</name>
</author>
<author>
<name sortKey="Cannon, Ch" uniqKey="Cannon C">CH Cannon</name>
</author>
<author>
<name sortKey="Pop, M" uniqKey="Pop M">M Pop</name>
</author>
<author>
<name sortKey="Douglas, Wy" uniqKey="Douglas W">WY Douglas</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wood, De" uniqKey="Wood D">DE Wood</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hach, F" uniqKey="Hach F">F Hach</name>
</author>
<author>
<name sortKey="Numanagi, I" uniqKey="Numanagi I">I Numanagić</name>
</author>
<author>
<name sortKey="Alkan, C" uniqKey="Alkan C">C Alkan</name>
</author>
<author>
<name sortKey="Sahinalp, Sc" uniqKey="Sahinalp S">SC Sahinalp</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Champarnaud, Jm" uniqKey="Champarnaud J">JM Champarnaud</name>
</author>
<author>
<name sortKey="Hansel, G" uniqKey="Hansel G">G Hansel</name>
</author>
<author>
<name sortKey="Perrin, D" uniqKey="Perrin D">D Perrin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mykkeltveit, J" uniqKey="Mykkeltveit J">J Mykkeltveit</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chvatal, V" uniqKey="Chvatal V">V Chvatal</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rhoads, A" uniqKey="Rhoads A">A Rhoads</name>
</author>
<author>
<name sortKey="Au, Kf" uniqKey="Au K">KF Au</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Branton, D" uniqKey="Branton D">D Branton</name>
</author>
<author>
<name sortKey="Deamer, Dw" uniqKey="Deamer D">DW Deamer</name>
</author>
<author>
<name sortKey="Marziali, A" uniqKey="Marziali A">A Marziali</name>
</author>
<author>
<name sortKey="Bayley, H" uniqKey="Bayley H">H Bayley</name>
</author>
<author>
<name sortKey="Benner, Sa" uniqKey="Benner S">SA Benner</name>
</author>
<author>
<name sortKey="Butler, T" uniqKey="Butler T">T Butler</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<affiliations>
<list>
<country>
<li>Israël</li>
<li>États-Unis</li>
</country>
<region>
<li>Pennsylvanie</li>
</region>
<settlement>
<li>Pittsburgh</li>
</settlement>
<orgName>
<li>Université Carnegie-Mellon</li>
</orgName>
</list>
<tree>
<country name="États-Unis">
<noRegion>
<name sortKey="Orenstein, Yaron" sort="Orenstein, Yaron" uniqKey="Orenstein Y" first="Yaron" last="Orenstein">Yaron Orenstein</name>
</noRegion>
<name sortKey="Kingsford, Carl" sort="Kingsford, Carl" uniqKey="Kingsford C" first="Carl" last="Kingsford">Carl Kingsford</name>
<name sortKey="Marcais, Guillaume" sort="Marcais, Guillaume" uniqKey="Marcais G" first="Guillaume" last="Marçais">Guillaume Marçais</name>
</country>
<country name="Israël">
<noRegion>
<name sortKey="Pellow, David" sort="Pellow, David" uniqKey="Pellow D" first="David" last="Pellow">David Pellow</name>
</noRegion>
<name sortKey="Shamir, Ron" sort="Shamir, Ron" uniqKey="Shamir R" first="Ron" last="Shamir">Ron Shamir</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000E30 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000E30 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     PMC:5645146
   |texte=   Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Main/Exploration/RBID.i   -Sk "pubmed:28968408" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021